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Abstract 



< 

This paper considers the problem of approximating the inverse of the wave-equation Hessian, 
also called normal operator, in seismology and other types of wave-based imaging. An expansion 
scheme for the pseudodiffcrcntial symbol of the inverse Hessian is set up. The coefficients in 
this expansion are found via least-squares fitting from a certain number of applications of the 
normal operator on adequate randomized trial functions built in curvelet space. It is found 
t-H that the number of parameters that can be fitted increases with the amount of information 

> present in the trial functions, with high probability. Once an approximate inverse Hessian is 

available, application to an image of the model can be done in very low complexity. Numerical 
experiments show that randomized operator fitting offers a compelling preconditioner for the 
ff) linearized seismic inversion problem. 

Acknowledgments. LD would like to thank Rami Nammour and William Symes for intro- 
ducing him to their work. LD, PDL, and NB are supported by a grant from Total SA. 

• Ch 1 Introduction 

1.1 Problem setup: Gauss-Newton iterations 

This paper considers the imaging problem of determining physical characteristics in a region of 
space given surface measurements of scattered waves. Several imaging modalities fall under this 
umbrella (ground-penetrating radar, nondestructive acoustic testing, remote personnel assessment), 
but in the sequel we focus exclusively on the example of reflection seismology. Throughout this 
paper we let m(x) for the physical parameters in the subsurface, and d(r, s, t) for the recorded 
waveforms (seismograms) . Here x are the space coordinates in the volume, r is receiver position, s 
is source position, and t is time. 

A most popular way of treating the inversion problem of recovering m from d is through the 
minimization of the output least-squares functional 
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where T is the nonlinear map for predicting data from the model m. In this paper we restrict 
ourselves to the setup of constant-density acoustics, and let m(x) be a variable wave speed. The 
prediction Tim] then consists of the solutions u - sampled at the receivers (r, t) - of the acoustic 
wave equations 

m(x)u tt - A x u = f s , 

with different right-hand sides f s {x, t) index by s (the source). The notation || • ||| refers to the sum 
of the squares of the components. The quantity m(x) is the inverse of the square of the local wave 
speed. 

Whether data is considered all at once, or by frequency increments as in full waveform inversion, 
the procedure for minimizing J[m] is usually some variant of the Gauss-Newton method, which 
consists in linearizing J[m] about some current vector mo. Specifically, if a new vector m\ is sought 
so that J[mi] is closer to the minimum than J[mo] is, then we first write 

3 J 1 -J 

J[mi] = J[m ] + (^t m oL <W + 2^ m ' t m °] + ••• 

where 5m = m\ — mo, and find 5m as the minimum of the quadratic form above. The solution is 

0= 5^ [mo] + 5m^ [m ° ]6m Sm = -{fo? [mo] ) 6m- [m ° ] - 

This equation is a Newton descent step: it is then applied iteratively to obtain a new TO2 from m±, 
etc. The Hessian is the operator ^[mo]. 

If J[m] is the least-squares misfit functional above, then by denoting 

we obtain the first and second variations of J as 

^-[m ] = -F*(d-T[m ]), 

p^[m ] = F*F-(p^[m ],d-T[m ]). 
om z om z 

The migration operator F* acts from data space to model space, and is most accurately computed 
by reverse-time migration. The demigration operator F acts from model space to data space, and 
can be computed by solving a forward "modeling" wave equation. The term involving the second 
variation of T in the expression of the Hessian is routinely discarded on the basis that T is "locally 
well-linearized" - a heuristically plausible claim when mo is smooth in comparison to 5m - but 
which has so far eluded rigorous analysis. With this simplification in mind, we refer to the (reduced) 
Hessian H as the leading-order contribution 

H = F*F. 

This linear operator is also called the normal operator, and acts within model space. The Newton 
descent step then calls for computing the pseudoinverse F + = (F* F)^ 1 F* , well-known to arise in 
the solution of the overdetermined linearized least-squares problem. 

Physically, inversion of the Hessian corresponds to the idea of correcting for low levels of il- 
lumination of the medium by the forward (physical) wavefield. Although illumination seems to 
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make good sense as a function of space x, it is in fact unclear how to define it as such. Rather, 
it is more appropriate to define illumination as a function in phase-space, i.e., the set of x and k 
(wave vectors). In the words of Nammour and Symes [21], illumination is not just a scaling, but 
a dip-dependent scaling. This paper follows this idea by considering the pseudodifferential symbol 
of the Hessian. 

While reasonably efficient methods of applying the operators F and F* to vectors are common 
knowledge, little is currently known about the structure of the inverse Hessian H . Direct linear 
algebra methods for computing a matrix inverse are out of the question, because the matrix H 
is too large to be formed in practice. This also prevents the immediate application of methods 
such as BFGS. Iterative linear algebra methods such as GMRES or LSQR can be set up, but need 
a very large number of iterations to converge due to the poor conditioning of H. The problem 
of slow convergence is particularly acute since the full prestack data space (after the application 
of F) is much larger than poststack model space - hence each application of H = F*F is very 
costly. The obvious alternative to the Gauss-Newton iteration, namely straight gradient descent 
without considering the Hessian, is even less attractive than GMRES for solving the ill-conditioned 
linearized least-squares problem. 

Preconditioning is needed to properly guide the inversion iterations. A preconditioner for a 
matrix H is a matrix M that approximates the inverse H , It can be used to rewrite Hx = b as 

PHx = PB, 

where now only the matrix PH needs to be inverted. An alternative formulation where P post- 
multiplies H is also possible. Several preconditioners for the wave equation Hessian have already 



been proposed in the literature: they are reviewed in context in Section 1.6 



This paper solves the preconditioning problem by "probing" , or testing the Hessian by applying 
it to a small number of randomized vectors, followed by a fit of the inverse Hessian in a special 
expansion scheme in phase-space. Our work is closest in spirit to that of Nammour and Symes [20, 
|2T] and Herrmann et al. |16j (which in turn follows from a legacy of so-called scaling preconditioners 
reviewed below) but departs from it in that the trial space is randomized instead of being the Krylov 
subspace of the migrated modeQ Randomness of the trial functions guarantees recovery of the 
action of the inverse Hessian on a much larger linear subspace than is normally the case with a 
deterministic method. This claim is backed both by numerical experiments (Section [2J and by a 
theoretical justification (Section [3]). 

The proposed approach bridges a gap in the literature, in that we obtain quantitative results 
- hence finally a rationale - for the probing methods to precondition the wave-equation Hessian. 
We found that randomization is an important step to achieve such guarantees, and may be an 
attractive numerical choice in its own right. 

1.2 Pseudodifferential symbol of the Hessian 

To provide an expansion scheme for the inverse Hessian, it is important to understand its structure 
as a pseudodifferential operator. In the sequel we consider only two spatial dimensions x = (a/, z), 
but the main ideas do not depend on this assumption. 



1 The Krylov subspace of a vector y, for a matrix H, is the space spanned by y, Hy, H 2 y, etc. 
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It is well-known that migration F* is a "kinematic" inverse of the modeling operator F in the 
sense that the mapping of singularities generated by F* generically undoes that of F. Putting 
technical pathologies aside, this claim means that H = F* F does not change the location of 
singularities in model space. Hence the Hessian is "microlocally equivalent" to the identity, or 
"microlocal" for short. This property was understood and made precise by at least the following 
people. 

• In 1985, Beylkin showed that the Hessian is pseudodifferential in the absence of caustics, and 
in the context of generalized Radon transforms [3]. 

• In 1988, Rakesh removed the no-caustic assumption, but considers a point source and full- 
aperture (whole-Earth) data |23J . 

• In 1998, ten Kroode, Smit and Verdel showed that Beylkin's result still holds if a less restrictive 
"traveltime injectivity condition" is satisfied [18] . 

• In 2000, Stolk refined these results by showing that the Hessian is generically invertible: if 
a C°° wave speed does not give rise to a pseudodifferential Hessian, an arbitrarily small C°° 
perturbation of it will [26 . 

The consequence of this body of theory for the problem of designing a compressed numerical 
representation of the Hessian is the following. We will consider a representation of the Hessian as 
a pseudodifferential operator: 

Hm(x) = J e lx - k a(x,k)m(k)dk, (1) 

where hat denotes Fourier transformation in the spatial variables. The amplitude, or symbol o(x, k) 
plays the role of illumination in phase-space (x, k) as alluded to earlier. 

There is nothing special about writing an integral sign instead of a sum: interpolation and 
sampling allow to transform number arrays into functions and vice-versa. Keeping x and k contin- 
uous for the time being however offers the opportunity to discuss the important point: smoothness 
of the symbol a(x,k). Indeed, while the symbol representation ([I]) is always available whichever 
the linear operator considered, the symbol will be smooth in a very specific way for "microlocal" 
operators as discussed above. We say that the symbol a(x,k) is of order e (and type (1,0)) if it 
obeys the condition 

|^fa(x,^)|<^(l + |A;| 2 )( e -H)/ 2 , (2) 
where a = (01,02), d% = J^JjS"; l a l = «i + 02, and similarly for j3. Notice that since the 

bound decreases by one power of (1 + |/c| 2 ) 1//2 for every derivative in k, it means that the larger |A;| 
the smoother the symbol a(x, k). If we had considered the symbol of either F or F* instead, each 
derivative in k space would have increased the value of the symbol by a quantity proportional to 
k. Physically, illumination is a phase-space concept, but it is "not too far" from being purely a 
function of x since the k dependence of a is extremely smooth for large |fc|. 

There is one very idealized scenario in which the Hessian obeys the condition ([2]) with order 
e = 1. The assumptions are the following: 1) sufficiently fine Cartesian sampling of the data in 
time and receiver coordinate (so that the sum can easily be written as an integral), 2) full aperture 
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acquisition, 3) a point-impulse wavelet S(t) in time, and 4) smooth and generic|^]background physical 
parameters such as wave speed. If all these conditions are met, it is known at least from |26j that 
the Hessian has a symbol that obeys Q. 

In turn, if a symbol obeys ([2]), it is by now well-known that it is in fact extraordinarily com- 
pressible numerically. Bao and Symes [1] show that the asymptotic behavior of a(x, k) as k — > 00, 
i.e. the action of the Hessian at very small scales, can be encoded using only a few Fourier series 
coefficients in x and in = arg k: 

a(x,k)^Y J Cx, n e iX - x e^\k\ (3) 

Recent work by Demanet and Ying [11] has shown how to add degrees of freedom in the radial 
wave number variable \k\ to obtain an e-accurate expansion of a(x, k): 

a(x,k)= c Xtquq2 e lX - x e^ e TL q2 (\k\)\k\+0(e), (4) 

A,(?1,<J2 

where the TL are rational Chebyshev functions. The number of terms in the sum is a 0(e~ M ) for all 
M > 0. Other expansion schemes exist, such as the hierarchical spline grids in k space, considered 
in [TT]. In practice, symbols are considered for values of k that obey max{|/ci|, I&2I} < vr-^/n for some 
large n. In view of the Shannon sampling theorem, this restriction corresponds to sampling (2D) 
functions on a square grid as vectors of length n, and operators such as the Hessian as matrices of 
size n-by-n. Both ([3]) and Q are good approximations of the symbol a(x, k) in the sense that they 
each contain a number of terms independent of the size n of the matrix that eventually realizes the 
Hessian. 

In three spatial dimensions, spherical harmonics would be used in place of complex exponentials 
in angle. Otherwise, the symbol expansion scheme needs not be changed. 

Equation Q provides a decomposition of H into "elemetary operators" Bi, each with symbol 
e e mS TLq(\k\) \k\. The index i is a shorthand for (A, (71,(72), and accordingly we let bi for the 
coefficients c\ qi m . In this more compact notation we have the fast-converging expansion 

for the Hessian. 

It is not the Hessian that is of interest, but rather the inverse Hessian. Fortunately, it is a result 
of Shubin [25] that if the symbol a(x, k) of an operator obeys Q, and if this operator is assumed 
to be invertible, then the symbol b(x, k) of the inverse of the operator also obeys ([2]), namely 

\d%d?b{x,k)\ <D a p(l + \k\ 2 )(- l -W/ 2 , (5) 

with constants D a p that are possibly different from C a p. Notice that the order is now —1. In 
other words, smoothness of the symbol is preserved, or closed, under inversion. If the operator is 
invertible but only barely so (small singular values which are not regularized), then the constants 

2 As above, "generic" here refers to the absence of kinematic exceptions that would discredit migration as a 
microlocal inverse, as discussed in [26]. Smooth means infinitely differentiable with oscillations on a length scale 
much larger than the wavelength of the wave. Random smooth media are "generic" with probability 1. 
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D a p may become large, but the behavior under differentiations in k space is still controlled by ([5]). 
Note in passing that b(x, k) is not exactly given by l/a(x, k), but the latter is an approximation of 
b(x, k) that mathematicians find satisfying when \k\ is large. 

Using the same expansion scheme as above we write 

i 

with different coefficients Cj. 



1.3 Ill-conditioning 

The four assumptions on the sampling, aperture, wavelet, and medium enumerated earlier are of 
course far from being realistic in practice. Their violation invariably creates ill-conditioning in the 
form of a linear subspace in model space where applying the Hessian will return very small values. 
This issue manifests itself as small values of the symbol a(x, k). For instance (and this may not be 
an exhaustive list), 

• Limiting the sampling and the aperture will create an angular deficiency in the sense that 
reflectors with certain orientations will not be visible in the dataset. The symbol a(x, k) will 
take on small values for the kinematically "invisible" x and k. 

• Restricting the wavelet in w space (frequency) will have the effect to remove low and high 
wavenumbers from the data. This will have the effect of restricting the symbol a(x,k) in 
wave number \k\. 

• Finally, complicated kinematics of the background wave speed(s) may create shadow zones 
in which there is very poor illumination. Such is the region behind an impenetrable sphere. 
In that case the symbol a(x, k) becomes very small in those inaccessible regions. 

The subspace of model space in which the Hessian produces small values is a numerical version 
of its nullspace Because this subspace is nonempty, not all vectors in model space are accessible 
from applying the Hessian to some other vector: the range space does not have full dimension. In 
other words, the Hessian does not have full rank. Fj It is well-known from linear algebra that the 
dimension of the (numerical) nullspace is equal to the codimension of the (numerical) range space. 
Because the Hessian is symmetric, the range space is in fact orthogonal to the nullspace - and ditto 
of their numerical versions. Figure [T] depicts the fundamental subspaces of the Hessian. 



3 The "numerical nullspace" is precisely denned as the span of the right singular vectors corresponding to singular 
values below some threshold. 

4 The "numerical range space" is precisely defined as the span of the left singular vectors corresponding to singular 
values above some threshold. 
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Figure 1: The fundamental subspaces of the Hessian (or of any Hermitian matrix). The nullspace 
is the set of vectors in model space to which an application of the Hessian produces zero values, 
and whose information is lost. The range space is the set of vectors which are the image of some 
other vector through an application of the Hessian — its dimension is the rank of H. The blue 
arrows indicate that, under the action of H, the whole space gets mapped to the range space, while 
the nullspace gets mapped to the origin. 



In spite of these complications, this paper speculates that for models well inside the range space 
of the Hessian, an estimate like ^ for the inverse Hessian holds. It is not currently known whether 
this is true theoretically, but we show numerical evidence that supports the claim. 

1.4 Randomized fitting 

We now address the question of fitting the coefficients in an expansion scheme for the symbol of 
the inverse Hessian, from application of the Hessian on randomized trial functions. For the time 
being we assume that the Hessian is invertible and well-conditioned; we return to the discussion of 
the nullspace in the next section. 

Assume that the inverse Hessian is an re-by-n matrix that can be expanded as 

p 



H- 1 =Y^aBi, (6) 



i=i 

where Bi are themselves matrices, and p counts the number of terms. One possible choice for the 



Bi was given in Section 1.2 (up to discretization), but here the discussion is general. Denote by 
y a vector of independent and identically distributed (i.i.d.) Gaussian random variables, in model 
space - a "noise" vector. The application of the Hessian to y is available: 

x = Hy 44> y = # _1 x. 

Given this information, we may now solve for the coefficients c« in 



p 

yj = ^2c i (B i x)j, j = l,...,n. 

i=l 
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This linear system can be overdetermined only if p < n; in that case the least-squares solution is 

Ci = ^2(M~ 1 ) ij (B j n.) k y k , 

where 

The coefficients c$ can therefore be solved for, in a unique and stable manner, provided the matrix 
M is invertible and well-conditioned. As we show in the sequel, the invertibility of M hinges on 
two important assumptions on the elementary matrices Bf. 

1. The Bi obey an fZ-dependent near-orthogonality relation: 

EMij = Tt(HBJ BjH) ~ Sij, 

which we express more precisely as requiring that EMjj be positive definite. The symbol E 
stands for mathematical expectation, or "average over an infinite number of random realiza- 
tions" . 

2. Each Bi is a full-rank (invertible), well-conditioned matrix. 

When those two conditions are met, we show in section [3] that is an invertible matrix, with 
high probability, provided p is large enough, on the order of the square root \fr of the rank r of 
M. This result may not be tight but has the advantage of motivating the two assumptions above. 
We suspect that the number p of coefficients q that can be fitted with this method is in fact closer 
to a constant times r/log 2 r - this will be the subject of a separate study. 

The expansion schemes in equations ^ and Q correspond to matrices Bi that obey the above 
conditions. 

Notice that if the expansion ^ is accurate, i.e. that H~ l is determined as a linear combination 
of the Bi, then the proposed method recovers the whole matrix H~ l in compressed form, not 
just the action of the matrix H^ 1 on the trial vector x. This property is important: we call it 
generalizability. The action of H~ l can be reliably "generalized" from its knowledge on x, to other 
vectors. The randomness of the vector y is essential in this regard: it would be much harder to 
argue generalizability if the vector y had been chosen deterministically. The numerical experiments 
in section [2] confirm this observation. 

Finally, it is worth noting that H~ l needs not be given exactly by a sum of p terms of the 
form CiBi. If the series converges fast instead of terminating exactly, it is possible to show that the 
coefficients Cj are determined up to an error commensurate with the truncation error of the series. 

1.5 Fitting via randomized curvelet-based models 

As mentioned earlier, inversion of the wave-equation Hessian is complicated by various factors that 
create ill-conditioning. The lack of invertibility not only prevents randomized fitting to work as 
presented in the previous section, but it also adds to the numerical complexity of the inverse Hessian 
itself. Just being able to specify the numerical nullspace - the subspace in which the Hessian erases 
information - is at least as complex as specifying the action of the inverse Hessian away from it. As 
a consequence, it may be advantageous for a coarse preconditioner not to explicitly try and invert 
the Hessian in the neighborhood of the numerical nullspace. 
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Our solution to the ill-conditioning problem is to consider noise realizations y that avoid the 
nullspace, i.e., belong to the range space of H. The relation y = H~ l ~x. then makes sense if we 
understand H' 1 as the pseudo-inverse of H. The numerical nullspace of H is best described in 
phase space: it corresponds to the points (x, k) where the symbol a(x, k) of H is small. This calls 
for considering an illumination mask, i.e., a simple 0-1 function which indicates whether a point 
(x,k) is in the essential support of the symbol (value 1) or not (value 0). This piece of a priori 
information is then used to filter out components of the noise vector (in x space) which would 
otherwise intersect the nullspace of the Hessian. 

An explicit expression for the pseudo-differential operator H can be obtained in the idealized 
case of densely sampled data with idealized sources and receivers. The process involves the asymp- 
totic expansion(stationary phase analysis) of a Generalized Radon Transform and is described in 
[3]. We use this expansion as a way of isolating the null space of H. 

Concretely, we built this illumination indicator function in curvelet-transformed model space. 
Curvelets are directional generalizations of wavelets which are efficient at representing bandlimited 
wavefronts in a sparse manner [5] [33] , and have had applications for regularizing the inversion in 
seismic imaging [51 [16], [T7] . They also provide a sparse representation of wave propagators [4J. Each 
curvelet tpn(x) is indexed by a position vector x^ and a wave vector k^. Any (square- integrable) 
function / can be expanded in curvelets as 



space where the symbol of the Hessian takes on different values. 

Consider S, the set of curvelets cp^ whose center {x^k^ belongs to the essential support of 
the symbol a(x, k) of the Hessian. The stationary phase analysis mentioned above [31 [28] reveals 
the geometric interpretation of these phase-space points: they are visible, in the (microlocal) sense 
that there is a ray linking some source s to the point x, reflecting at x in a specular fashion about 
the normal vector k, and then linking x back to some receiver r. When a curvelet is visible, it 
means that it acts like a "local reflector" for some waves that end up being observed in the dataset. 
More precisely, a phase-space point (x^, k^) belongs by definition to S if there exist two rays j s , j r 
originating from x^ such that: 

• 7 S links x^ to some source in the source manifolcj^J 

• 7 r links x^ to some receiver in the receiver manifold; and 

• 7 r is a reflected ray for 7*, at x„, i.e., the angle of incidence is equal to the angle of reflection 
and the two rays form a plane with the normal direction fc„. 

The rays are obtained by ray-tracing from the Hamiltonian system of geometrical optics. The 
illumination mask is then the sequence equal to 1 if fi £ S, and zero otherwise. A noise realization 
y in curvelet space, filtered by the illumination mask, is simply 



Conventionally, an interval or an otherwise open set of positions in which the sampling of sources (resp. receivers) 
is dense enough in view of the typical wavelength of the seismic waves. 





curvelets efficiently discriminate between different regions of phase- 
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The sequence y M is then inverted to yield 

The rest of the algorithm for determining the inverse Hessian then proceeds as in the previous 
section. 

Once the inverse Hessian is available as a series ([6]), the algorithm for applying it to a vector 
like the migrated model is well-known and very fast [IJ [11] . 

1.6 Previous work 

Being able to extract information on the inverse Hessian from a single application of the Hessian is 
a very good idea which perhaps first appeared, in seismology, in the work of Claerbout and Nichols 
[BJ. There, a single scalar function of x is sought to represent inverse illumination. In our notations, 
they seek to fit a symbol b(x, k) which is not a function of k. 

This work generated refinements that W. Symes puts under the umbrella of "scaling methods" . 
In 2003, Rickett [24J offers a solution similar to that of Claerbout and Nichols. In 2004, Guitton 
|13j proposes a solution based on "nonstationary convolutions" which corresponds to considering 
a symbol b(x, k) which is essentially only a function k. In 2008, Symes [27[ proposes to consider 
symbols of the form 

b(x,k)~f(x)\k\-\ 

i.e. which have the proper homogeneity behavior in \k\. In 2009, Nammour and Symes \20\ [2Tj 
upgrade to the Bao-Symes expansion scheme given in equation ([3]). In 2009, Herrmann et al. [TBJ 
propose to realize the scaling as a diagonal operator in curvelet space. 

In all these papers, it is the remigrated image to which the inverse Hessian is applied; in contrast, 
our paper uses randomized curvelet trial functions. For the representation of the inverse Hessian, 
we use both ^ and Q for its symbol. 

It should also be noted that Herrmann et al. [15] already proposed in 2003 to realize a curvelet- 
diagonal approximation of the Hessian, obtained by randomized testing of the Hessian. 

The idea of recovering a matrix that has a given sparsity pattern or some other structure from 
a few applications on well-chosen vectors ( "probing" ) also appeared in the 1990 work of Chan and 
Keyes on domain-decomposition preconditioning for convection-diffusion problems [7]. See also the 
1991 work of Chan and Mathew [8]. 

The related idea of computing a low-rank approximation or "skeleton" of a matrix by means of 
randomized testing, albeit without a priori knowledge of the row and column spaces, was extensively 
studied in recent work of Rokhlin et al. \19\ [31], and Martinsson and Tropp [14 . 

2 Numerical results 

The classical Marmousi benchmark example is the basis of all our numerical experiments. The 
forward model is taken to be the linearized wave equation 

mo(x)utt - Ait = -5m(x)(uo)tt, 
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where the incident field uq obeys 



m (x)(uo)tt ~ Au = f s , 

with f s (x,t) = 5(x — s)w(t). The wavelet w(t) is taken to be the second derivative of a gaussian 
(Ricker wavelet). The background medium mo is either taken to be constant (in Sections 2.1 2.2), 
or a smoothed version of the original Marmousi model with various degrees of smoothing (in Section 



2.3). The data d(r,s,t) are then collected as the samples of u at receiver positions r and source 



positions s at the surface z = 0, and all adequate times t. 



The same equations are then used for the imaging, with thq and f s assumed known, but not 
5m(x). This is known as the "inversion crime", as any real-life imaging application would also 
require to solve for mo and f s - problems that we leave aside in this paper. Notice also that the 
forward model is linear in 5m(x), a clearly uncalled-for assumption in practice since it neglects 
multiple scattering. A better wave equation for u would have uu in place of {uq)u in the right- 
hand-side. We nevertheless made this assumption so as not to obscure the fact that the Hessian is 
intrinsically present to correct the solution of the linearized inverse problem. 

For the convenience of being able to run hundreds of simulations in a matter of hours, we 
choose to consider a 2D problem on a square domain with N 2 points, N = 127 for most of the 
results shown. A perfectly matched layer (PML) of width .15iV surrounds the domain of interest. 
The numerical method has spectral differences in space, and second-order differences in time. The 
poststack imaging operator F* performs a stack on three sources maximally spaced from each other 
(albeit not in the PML). More sources were used in some of the numerical experiments, but this 
did not significantly affect the inverse Hessian. As is well-known, the main advantage of using more 
sources is the robustness to noise. (All the imaging results are robust to additive gaussian white 
noise, but not to purely multiplicative gaussian white noise.) 

Two types of preconditioners are compared: 

• Rn: Fitting of the inverse Hessian from randomized curvelet trial functions. This precondi- 
tioner is denoted as Rn where n is the number of trial functions used for the fitting, e.g. R4 
is four functions were used. 

• Kn: Fitting the inverse Hessian from trial functions taken in the Krylov subspace of the 
migrated image. This preconditioner is denoted Kn where n is the number of trial functions 
used for the fitting, e.g. K2 if both the migrated image and the remigrated image were used. 
This is essentially the method of Nammour and Symes |20^ [21] , with the slight improvement 
of using the full expansion Q in place of ^ - a minor point. 



In both cases the Bi are the elementary symbols of equation Q. Different numbers of terms 
are tested in this pseudodifferential expansion: in order of decreasing importance, the parameters 
are 1) number of Fourier modes in x, 2) number of Fourier modes in the wavevector argument 8, 
and 3) number of Chebyshev modes in the wavenumber \k\. The right balance of parameters in 
each dimension was obtained manually for best accuracy; only their total number (their product) 
is reported. 

The action of the preconditioners on the migrated image F*d is compared to the image obtained 
after 200 gradient descent steps for the (linearized) least-squares functional. The refinement of this 
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brute force method to an iterative solver such as GMRES or LSQR is important in practice, but 
was not investigated in the scope of this paper. 

Errors between models are measured in the relative mean-squared sense, i.e. if 5m\ is a reference 
model and 5m,2 another model, then 

MSE(gm 1 ^m 2 )= l|J 7"^ 112 . 



2.1 Basic results 

The action of the preconditioners on the migrated image is satisfactory: as the figures below show 
it is visually closer to the image obtained after 200 gradient steps than the migrated image. 




Figure 2: Left: oscillatory wave speed profile ("reflectors") used to produce wavefield data. The 
forward model is the linearized wave equation with a unit background speed. Middle: migrated 
image, obtained by reverse-time migration. Right: image obtained by 200 gradient descent steps 
to solve the linearized least-squares problem. 



Figure 3: Left: a randomized curvelet trial function, used for testing the Hessian in order to fit 
the inverse Hessian. Middle: image obtained by applying the R4 preconditioner to the migrated 
image. Right: image obtained by applying the Kl preconditioner to the migrated image. 

The Krylov preconditioner Kl usually works well on the migrated image. The randomized 
preconditioner Rl is often a notch worse than Kl, but when going up to R4 and higher the 
performance becomes very comparable to Kl. We did not find an instance where any Rn, regardless 
of n, would significantly outperform Kl (a puzzling observation). However, we notice in Figure [5] 
that as the dimension of the Krylov subspace increases, the performance of K2, K3, etc. deteriorates 
very quickly. This is in contrast to what was advocated in |20| I2T] . 

There is a sweet spot in the number of parameters in the symbol expansion of the inverse 
Hessian, around 500 to 1000 for the numerical scenario considered. See Figure |4j If the number 
of parameters is too small, the inverse Hessian is not properly represented. If the number of 
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parameters is too large, they are either not used to improve the representation of the Hessian, or 
their large number leads to ill-conditioning of the fitting problem (hence large numerical errors.) 



> 
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Figure 4: Relative MSE of the Rl, R7 and Kl preconditioners, as a function of the number of 
parameters. Left: preconditioned migrated image vs. slightly modified "true" image of Figure [2j 
left. By "slightly modified" , we mean that a curvelet mask is taken to only measure the components 



of those images in the set S (see section 1.5). Right: preconditioned migrated image vs. recovered 
image of Figure [2j right. No curvelet mask is taken here. Any MSE below 1 (100 percent relative 
error) indicates that the preconditioning is working. 



Note that in this experiment the Hessian is a 16, 384-by-16, 384 matrix. Its numerical rank 
hovers in the few thousands; more precisely, for a top singular value normalized to unity, the e- 
rank as a function of e is given by the following table. We attribute the rank deficiency mostly to 
the perfectly matched layer (PML) and other windows applied. 



e 


e-rank 


le-1 


435 


le-2 


1367 


le-3 


2164 


le-4 


2803 


le-5 


3250 


le-6 


3624 



2.2 Generalization error 

The Rn preconditioners show their true potential when the inverse Hessian is applied to another 
randomized trial function, drawn independently from those used for fitting the symbol, see Figure 
[5j right. Generalizability to a large linear subspace of models is as the theory predicts. The Krylov 
preconditioners, on the other hand, show some fragility here. They are not designed to work when 
applied on images far from the remigrated image, and indeed, the error level is higher for Kl than 
for any Rn. 
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Figure 5: Relative MSE of generalization. The setup is the same as in the previous section, 
except that the Marmousi model was replaced by an (indepently drawn) randomized curvelet trial 
function. Here the error is simply the relative MSE for the reconstruction of this randomized trial 
function, from applying the Hessian followed by a preconditioner. The x axis shows the number 
of parameters. Left: Kl, K4, and K7 preconditioners applied to the randomized trial function, 
vs. image obtained by 200 gradient descent iterations. The performance quickly degrades with the 
order of the Krylov subspace. Right: Rl, R4, and R7 preconditioners applied to the randomized 
curvelet trial function, vs. reference image. Notice that the error is smaller than in the Krylov 
case, and that the performance does not degrade with the index of the preconditioner. 

The degradation of the Kn preconditioners as n increases is understandable. In applying the 
normal operator 1,2, ... , n times to the migrated image, information is lost in all but the eigenspaces 
corresponding to leading eigenvalues. This is well-known from the analysis of the power method in 
linear algebra. As a result, the disproportionate weight lent to those subspaces "hijacks" most of 
the degrees of freedom of the symbol expansion and prevents a good fit. 

The robustness of the Rn preconditioners offered by generalizability may be useful in the scope 
of preconditioned gradient descent iterations. While H^ 1 is applied to F*d (migrated image) in 
the first iteration, it is subsequently applied to F*(d — F 5mk) (migrated residual). The latter will 
deviate from F*d in the course of the iterations, resulting in a weaker Kl preconditioner. 

2.3 Variable media 

The curvelet mask used in the definition of the randomized trial functions is a set S in curvelet 
space indicating whether the curvelet is "visible in the dataset" or not. In the case of a uniform 
medium, this information is obtained by considering the fan of couples of lines originating from 
each curvelet's center point, for which the angle of incidence equals the angle of reflection. For a 
given curvelet the test is whether one of the lines joins the curvelet to a source while the other 
line joins the curvelet to a receiver. If this test returns a positive match for one couple of lines, we 
declare that the curvelet is active and its index belongs to the set S. 

In the case of smooth variable media, the test is similar but now involves ray tracing, i.e., 
computing the trajectories of the Hamiltonian system of geometrical optics. This is performed 
ray-by-ray using the high-order adaptive Runge-Kutta time integrator ode45 built in Matlab. Ray- 
tracing is normally not a computational bottleneck; if solving for the rays one-by-one is too slow, 
a fast algorithm such as the phase-flow method of Ying and Candes [32] can be set up to speed up 
the process. 

For the numerical experiment we take the smooth part of the Marmousi model M{x) and 
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smooth it further by convolution with a radial bump. This operation is realized in the wavevector 
domain, by multiplying the Fourier transform of M(x) by the indicator function of a disk of radius 
rN (the whole wavevector space is a square of sidelength N). We let < 7 < 0.4 and consider 
Mj(x) the further-smoothed Marmousi background model velocity. Then we set 

m (x) = (l - q^) / M(x)dx + ^M 7 (x). 

If 7 = we recover a uniform medium. The MSE of the R5 preconditioner as a function of 
< 7 < 0.4 is shown below. Most of the numerical tests performed in the earlier sections were 
repeated in variable media: we did not find that any particular plot was worth reporting, as the 
performance systematically degrades in a predictable manner as 7 increases. 




27 121 538 2401 

(J of parameters (log scale) 

Figure 6: Relative MSE of the R5 preconditioner applied to the migrated image, vs. the image 
obtained by 200 gradient descent iterations (Figure [2j right). The x axis shows the number of 
parameters. The different curves refer to different smoothness levels of the model velocity, as 
explained in the text. 

2.4 Other tests 

Other sizes, from N = 64 to N = 256 were tested and showed similar performance levels. 
Other randomized trial functions than "curvelet-masked noise" were attempted, such as 

• Gaussian white noise in model space, which failed badly because it contains too much energy 
in the nullspace, with high probability. 

• Gaussian white noise in data space, migrated to model space. Such trial functions still have 
too much energy in the nullspace and led to unequivocally poor results. 

• Gaussian white noise in model space, to which the normal operator is applied. These trial 
functions work well, and show error levels comparable (at times slightly worse) than the 
curvelet trial functions. They have the advantage of being simple to define - no need for 
curvelets - but more complicated to compute as each randomized trial function requires one 
application of the expensive Hessian. 

• Gaussian white noise in model space, to which the normal operator is applied, followed by 
a diagonal operation in curvelet space where the coefficient magnitudes are either put to 1 
or to zero if they are under a small threshold. Coefficient phases are unchanged. These trial 
functions are comparable to the simpler ones defined directly in curvelet space. 
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• Other distributions than gaussian for the noise: this did not give rise to any noticeable 
difference in our numerical experiments. Lemmas are indeed often available to pass from one 
distribution to the other in large deviation theory. 

The fitting of the inverse Hessian was also realized from an application of the Hessian to the 
desired unknown model that served to create the data. This operation can of course not be 
performed in practice since we are precisely trying to invert for this unknown model. But the 
numerical experiment is very instructive: it shows that the relative MSE of the Rn preconditioner 
applied to the migrated image decays to such small values as 0.1 when the number of parameters 
is large enough; the MSE does not stall on a plateau at 0.3 like it does in all the figures above. 
This goes to show that the pseudodifferential expansion is instrinsically good, but that neither the 
Krylov fit nor the randomized fit is fine enough to predict the right coefficients. This leaves exciting 
room for improvement of the method. 



3 Theory 

3.1 Invertibility of M 



To carry out the least squares minimization in Section 1.4, the n by n matrix M has to be well- 
conditioned. In this section, we will show that this happens with high probability (whp) when the 
number of parameters p is related to the (numerical) rank r of H through 

r > Cp 2 logp, for some C > 0. 

If H were an invertible matrix, we would simply let y ~ N(0, l) n , independent and identically 
distributed (iid). But in the general case, and as mentioned earlier, we should make sure that y 
is properly "colored" to avoid the nullspace of H. While our numerical solution to this problem is 
approximate, we will assume for simplicity that we can exactly project y onto the range space of 
H, 

y = Py, 

where P is the orthogonal projector onto Ran(-ff). 

The random matrix M to invert for the fitting step is then 

Mij = y T (H Bj BjH)y . 

It holds that My = y T (HBj BjH)y without the tildes, hence 

EMij = Tr(HBfBjH). 

It is assumed that EM is positive definite and well-conditioned; our argument consists in showing 
that M does not depart too much from its expectation whp. 

Let || • || and || • \\p denote the spectral and Frobenius norms respectively. We denote by n the 
condition number of EM, 

k = ||EM|| || (EM) -1 1| . 
We also need to consider rj > 0, the smallest number such that 

uniformly over i. We may call rj the "weak condition number" of the collection of HB{. 
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Both k and r\ are greater than 1, but it will be manifest from the way they enter the estimates 
below that they ought to be small (close to 1). If r] is small, then HBi has approximate numerical 
rank r, i.e., the largest r singular values are comparable in size. 

The following result is a perturbative analysis quantifying the size of ||M — EM|| in relation to 
||EM||. 

Theorem 1. Assume that H is a symmetric rank-r matrix that can be written as H = Y^=i c iBi- 
Define M y - and rj as above. For all < e < 1, there exists a number C(e, rj) > such that, if 



r > C(e, r/)p 2 logp, 



then with high probability 

\\M - EM\\ < e||EM||. 
Explicitly, C{e,rj) = 160?7 4 e _2 , and the "high probability" is at least 1 — 2p~ s . 

Before we prove this theorem, let us explain how invertibility of M follows at once. Since the 
condition number of EM is k, its minimum eigenvalue obeys 

A min (EM) > - ||EM||. 

When a matrix is perturbed, the change in eigenvalues is controlled by the spectral norm of the 
perturbation, so 

Amin(M) > Q -e\ ||EM||, whp. 

It suffices therefore to apply the theorem above with e < ^ to ensure invertibility of M. 

Proof of Theorem^ Let us first settle that My = y T {HBj BjH)y , without the tildes. It suffices 
to argue that HP = H. By transposition, and symmetry of both H and P, it suffices to show that 
PH = H. This latter equation is obviously true since P acts as the identity on the range space of 
H. 

Now let L = ||EM||. Our proof considers the statistics of My- element-wise as a quadratic form 
of the gaussian random vector y. We will show that My- is highly unlikely to be more than eL/p 
away from EMy. In what follows we use the || • ||i and || • ||oo induced matrix norms - the maximum 
absolute column and row sums respectively. If we can show that |Mjj — EMy| < eL/p for all 
then the following inequality completes the proof: 

||M - EM || 2 < ^ (\\ M ~ + \\ M ~ EM Hoc) < eL. 

The statistics of quadratic forms y T Ay were perhaps first completely studied by Grenander, 
Pollak and Slepian [12]. In a nutshell , the variance of the quadratic form M^ = y T (HBf BjH)y 
is known to be proportional to \\\{HBj BjH + HBj BiH)\\ 2 F . We seek to bound these variances 
using the fact that the HBi are "weakly well-conditioned" . 

Fix i. We know that 

L = ||EM|| > |EAf«| = \Tr(HB?BiH)\ = 

Using the definition of r\ we obtain a stronger bound on the spectral norm, namely ||-ffi?i|| < 
rj 1 1 HB{ \\f / \fr < Vy/L/r. The implication is that for all i, j, 

\\HBjBjH\\2 < r, 2 L/r. (7) 
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As for the Frobenius norm of HBf BjH, we make use of the fact that H has rank r to bound 

\\HBj BjH\\p < \\HBjBjH\\^ = rj 2 L/^. (8) 

We are now ready to bound Pr(|Mjj — EMj,-| > eL(p)/p). For clarity, fix i,j and let A = 
HBf BjH . The standard deviation of A is proportional to which by Eq. ^ is roughly on 

the order of L/y/r or L/p. This is qualitatively correct. For an explicit bound, we refer to Bechar 
[2], who builds on the work of |12| to state the following. 

Lemma 1. Let A G R nxn and y ~ N(0, 1)" iid. Then for any A > 0, 

Pr(|y T yly-Ey T ^y| > ||A + A T || F \/A + 2||4||A) < 2exp(-A). 

We pick A = 10 log p. It is straightforward to verify that with this choice of A, with the definition 
of C(e,n), and with equations Q and ([8]), we have 

\\A + A T \\ F V\ + 2|| A|| 2 A < eL/p. 

It follows that 

¥r{\Mij - EM i:j \ > sL(p)/p) < 2exp(-A) < 2p~ 10 . 

An union bound over p 2 pairs of i,j's concludes the proof. Note in passing that we made no 
effort to minimize C{e,rj). 

□ 

Finally, we sketch a standard procedure to handle complex-valued matrices. Instead of taking 
the symmetric part of A by y T Ay = y T {]^{A + A T ))y, decompose it into Hermitian and anti- 
Hermitian components, that is y T Ay = y T ^4iy — iy T A2y where A\ = \{A + A*) and A2 = 
I ( A — A*) are both Hermitian. Then bound the deviations from their expectations separately by 
\y T Ay — Ky T Ay\ < \y T A\y — Ky T Aiy\ + \y T A2y — Ey T A2y\. Repeat similar arguments and invoke 
Lemma [T] to show that each term is less than eL/2p whp. 

3.2 Rationale for curvelets 

The success of the proposed method for inverting the Hessian depends on the property of phase- 
space localization of curvelets. Good localization of a basis function like a curvelet near a point 
(x, k) implies that it will only "see" values of the symbol a(x, k) near that point, when acted upon 
by the Hessian. 

The following result makes this heuristic precise; it is a minor modification of a theorem of 
Stolk [T7] so the proof is omitted. 

Theorem 2. (Stolk, 2008). Let a(x,k) be the pseudodifferential symbol of the wave equation 
Hessian H , as in equation |Ip ; and assume that it obeys with m = 1. Consider the zeroth-order 
symbol a(x,k)\k\~ 1 of the operator H(-A)^ 1 ^ 2 . Denote by H(— A) -1 / 2 the diagonal approximation 
of H(-A)^ 1 / 2 in curvelet space, with the sampled symbol as multiplier, 

H{-A)- l ' 2 f = ^^(x)a(x M ,^)|fc M |- 1 f Tp lx {x)f{x)dx. 

If f obeys f{k) = for \k\ < k m \ n , then there exists C > such that 

\\(H-H)(-A)-V 2 f\\2<-C=\\f\\2. 
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In other words, the more oscillatory the model f{x) the better the diagonal approximation of 
the Hessian via curvelets. Hence the larger k the better the "probing" character of a curvelet near 
its center in phase-space. 

The theorem above is also true for another frame of functions, the wave atoms of Demanet and 
Ying [TU], but would not be true for wavelets, directional wavelets, Gabor functions, or ridgelets. 

4 Conclusion 

This paper presents a preconditioner for the wave equation Hessian based on ideas of randomized 
testing, pseudodifferential symbols, and phase-space localization. Numerical experiments show that 
the proposed solution belongs to a class of effective "probing" preconditioners. The precomputation 
only requires applying the wave equation Hessian once, or a small number of times. 

Fitting the inverse Hessian involves solving a small least-squares problem, of size p-by-p, where 
p is much smaller than n and the Hessian is n-by-n. Even if p were on the order of n the proposed 
method would be very advantageous since constructing each row of the Hessian requires going back 
to the much higher dimensional data space. 

It is anticipated that the techniques developed in this paper will be of particular interest in 
3D seismic imaging and with more sophisticated physical models that require identifying a few 
different parameters (elastic moduli, density). In that setting, properly inverting the Hessian with 
low complexity algorithms to unscramble the multiple parameters will be particularly desirable. 
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